Revisiting stochastic off-policy action-value gradients

نویسندگان

  • Yemi Okesanjo
  • Victor Kofia
چکیده

Off-policy stochastic actor-critic methods rely on approximating the stochastic policy gradient in order to derive an optimal policy. One may also derive the optimal policy by approximating the action-value gradient. The use of action-value gradients is desirable as policy improvement occurs along the direction of steepest ascent. This has been studied extensively within the context of natural gradient actor-critic algorithms and more recently within the context of deterministic policy gradients. In this paper we briefly discuss the off-policy stochastic counterpart to deterministic action-value gradients, as well as an incremental approach for following the policy gradient in lieu of the natural gradient.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Expected Policy Gradients

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates across the action when estimating the gradient, instead of relying only on the action in the sampled trajectory. We establish a new general policy gradient theorem, of which the stochastic and de...

متن کامل

Policy gradients in linearly-solvable MDPs

We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy g...

متن کامل

Expected Policy Gradients for Reinforcement Learning

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussi...

متن کامل

Bridging the Gap Between Value and Policy Based Reinforcement Learning

We establish a new connection between value and policy based reinforcementlearning (RL) based on a relationship between softmax temporal value consistencyand policy optimality under entropy regularization. Specifically, we show thatsoftmax consistent action values satisfy a strong consistency property with optimalentropy regularized policy probabilities along any action sequence...

متن کامل

Using Policy Gradients to Account for Changes in Behaviour Policies Using Policy Gradients to Account for Changes in Behaviour Policies under Off-policy Control

Off-policy learning refers to the problem of learning the value function of a behaviour, or policy, while selecting actions with a different policy. Gradient-based off-policy learning algorithms, such as GTD (Sutton et al., 2009b) and TDC/GQ (Sutton et al., 2009a), converge when selecting actions with a fixed policy even when using function approximation and incremental updates. In control prob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1703.02102  شماره 

صفحات  -

تاریخ انتشار 2017